This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
##set.seed(1)
gliders<-read.csv("~/Desktop/IMOS_-_Australian_National_Facility_for_Ocean_Gliders_(ANFOG)_-_delayed_mode_glider_deployments.csv", skip = 41, header = TRUE)
目前为止做了四个部分: 第一个部分是清除invalid data (line57-69) 第二个是understand the dataset(The percentage of good data and bad data for each variable, number of missing value for each variable,(line 144-252),find the potential relationship between response vs each variable (line254,318)) 第三个部分是calculate the correlation matrix and create correlation plot (line136-142) 第四个部分是find the response distribution (line 72-134)
str(gliders)
## 'data.frame': 3123117 obs. of 58 variables:
## $ FID : chr "anfog_dm_trajectory_data.fid-7f408395_174917ab189_-5d76" "anfog_dm_trajectory_data.fid-7f408395_174917ab189_-5d75" "anfog_dm_trajectory_data.fid-7f408395_174917ab189_-5d74" "anfog_dm_trajectory_data.fid-7f408395_174917ab189_-5d73" ...
## $ file_id : int 185 185 185 185 185 185 185 185 185 185 ...
## $ deployment_name : chr "TwoRocks20130215" "TwoRocks20130215" "TwoRocks20130215" "TwoRocks20130215" ...
## $ platform_type : chr "slocum glider" "slocum glider" "slocum glider" "slocum glider" ...
## $ platform_code : chr "SL248" "SL248" "SL248" "SL248" ...
## $ time_coverage_start : chr "2013-02-15T03:13:29Z" "2013-02-15T03:13:29Z" "2013-02-15T03:13:29Z" "2013-02-15T03:13:29Z" ...
## $ time_coverage_end : chr "2013-03-11T20:14:20Z" "2013-03-11T20:14:20Z" "2013-03-11T20:14:20Z" "2013-03-11T20:14:20Z" ...
## $ TIME : chr "2013-03-06T22:18:19Z" "2013-03-06T22:18:22Z" "2013-03-06T22:18:23Z" "2013-03-06T22:18:26Z" ...
## $ TIME_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num -31.8 -31.8 -31.8 -31.8 -31.8 ...
## $ LATITUDE_quality_control : int 8 8 8 8 8 8 8 8 8 8 ...
## $ LONGITUDE : num 115 115 115 115 115 ...
## $ LONGITUDE_quality_control: int 8 8 8 8 8 8 8 8 8 8 ...
## $ PRES : num 32.8 33.1 33.4 33.7 34 ...
## $ PRES_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DEPTH : num 32.5 32.8 33.1 33.5 33.8 ...
## $ DEPTH_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PROFILE : int 4907 4907 4907 4907 4907 4907 4907 4907 4907 4907 ...
## $ PROFILE_quality_control : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PHASE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PHASE_quality_control : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TEMP : num 23.8 23.8 23.8 23.8 23.8 ...
## $ TEMP_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PSAL : num 35.3 35.3 35.3 35.3 35.3 ...
## $ PSAL_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DOX1 : num 201 201 201 201 201 ...
## $ DOX1_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DOX2 : num 197 197 197 197 197 ...
## $ DOX2_quality_control : int 3 3 3 3 3 3 3 3 3 3 ...
## $ CPHL : num 0.255 0.258 0.263 0.262 0.258 ...
## $ CPHL_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ CDOM : num 0.889 0.846 0.845 0.955 0.801 ...
## $ CDOM_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ CNDC : num 5.23 5.23 5.23 5.23 5.23 ...
## $ CNDC_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ VBSC : num 4e-04 4e-04 5e-04 8e-04 5e-04 4e-04 4e-04 4e-04 4e-04 7e-04 ...
## $ VBSC_quality_control : int 1 1 1 1 1 1 1 1 1 1 ...
## $ NTRA : num NA NA NA NA NA NA NA NA NA NA ...
## $ NTRA_quality_control : int NA NA NA NA NA NA NA NA NA NA ...
## $ UCUR : num NA NA NA NA NA NA NA NA NA NA ...
## $ UCUR_quality_control : int 9 9 9 9 9 9 9 9 9 9 ...
## $ VCUR : num NA NA NA NA NA NA NA NA NA NA ...
## $ VCUR_quality_control : int 9 9 9 9 9 9 9 9 9 9 ...
## $ HEAD : num 300 300 302 302 304 ...
## $ HEAD_quality_control : int 0 0 0 0 0 0 0 0 0 0 ...
## $ UCUR_GPS : num NA NA NA NA NA NA NA NA NA NA ...
## $ UCUR_GPS_quality_control : int 9 9 9 9 9 9 9 9 9 9 ...
## $ VCUR_GPS : num NA NA NA NA NA NA NA NA NA NA ...
## $ VCUR_GPS_quality_control : int 9 9 9 9 9 9 9 9 9 9 ...
## $ IRRAD443 : num 0.272 0.264 0.258 0.258 0.258 ...
## $ IRRAD443_quality_control : int 4 4 4 4 4 4 4 4 4 4 ...
## $ IRRAD490 : num 0.354 0.346 0.34 0.338 0.338 ...
## $ IRRAD490_quality_control : int 4 4 4 4 4 4 4 4 4 4 ...
## $ IRRAD555 : num 0.0712 0.0675 0.0675 0.0666 0.0634 0.06 0.0609 0.0618 0.0609 0.0557 ...
## $ IRRAD555_quality_control : int 4 4 4 4 4 4 4 4 4 4 ...
## $ IRRAD670 : num 0.0114 0.0133 0.0128 0.0104 0.0104 0.0123 0.0114 0.0104 0.0109 0.0133 ...
## $ IRRAD670_quality_control : int 4 4 4 4 4 4 4 4 4 4 ...
## $ geom : chr "POINT (114.98013440372972 -31.802694506260686)" "POINT (114.98013265657173 -31.802693721740056)" "POINT (114.98013155928385 -31.802693229028474)" "POINT (114.98012962223838 -31.802692359243295)" ...
attach(gliders)
#install.packages("visdat")
library("visdat")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("gamlss")
library(gamlss)
## Loading required package: splines
## Loading required package: gamlss.data
##
## Attaching package: 'gamlss.data'
## The following object is masked from 'package:datasets':
##
## sleep
## Loading required package: gamlss.dist
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: parallel
## ********** GAMLSS Version 5.2-0 **********
## For more on GAMLSS look at https://www.gamlss.com/
## Type gamlssNews() to see new features/changes/bug fixes.
library(gamlss.dist)
#install.packages("gamlss.add")
library(mgcv)
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
library(gamlss.add)
## Loading required package: nnet
##
## Attaching package: 'nnet'
## The following object is masked from 'package:mgcv':
##
## multinom
## Loading required package: rpart
library("MASS")
#install.packages("goftest")
library("goftest")
library(fitdistrplus)
## Loading required package: survival
library("corrplot")
## corrplot 0.84 loaded
Missing data distribution
gliders%>%
sample_n(100000) %>%
vis_miss(warn_large_data = FALSE)
We will check the data and delete the invalid data
#library(dplyr)
#count(PSAL >= 2 & PSAL <= 41)
gliders_valid<-gliders[(PSAL >= 2 & PSAL <= 41),]
#count(gliders_valid$CPHL>= 0&gliders_valid$CPHL<=100)
gliders_valid<-gliders_valid[(gliders_valid$CPHL>= 0&gliders_valid$CPHL<=100),]
#count(gliders_valid$CDOM >=0 & gliders_valid$CDOM <= 400)
gliders_valid<-gliders_valid[(gliders_valid$CDOM>=0& gliders_valid$CDOM <= 400),]
#count(gliders_valid$VBSC>=0& gliders_valid$VBSC <= 0.1)
gliders_valid<-gliders_valid[(gliders_valid$VBSC>=0 & gliders_valid$VBSC <= 0.1),]
#count(gliders_valid$IRRAD555>=0& gliders_valid$IRRAD555 <= 1000)
gliders_valid<-gliders_valid[(gliders_valid$IRRAD555>=0& gliders_valid$IRRAD555 <= 1000),]
We look at the response distribution. However, it does not belongs to normal, gamma, lognormal,logistic or beta distribution since the p-value is small. We decide to check the distribution when we are doing modeling. We will compare the MSE and other method to check the accuracy for our model in different distribution.
plotdist(gliders_valid$CPHL)
We first check if it is a gamma distribution, the density plot since to be correct
#install.packages("fitdistrplus")
library("fitdistrplus")
library(ggplot2)
summary(gliders_valid$CPHL)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.4 0.5 0.6 0.7 15.9 1514906
plot(density(gliders_valid$CPHL, na.rm = TRUE), xlim = c(0,2))
CPHL_positive <- gliders_valid[(gliders_valid$CPHL >0),]
CPHL_positive_value <- CPHL_positive$CPHL
fit_CPHL<-fitdistr(na.omit(CPHL_positive_value), "gamma")
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
rand.gamma <- rgamma(100000,
shape = fit_CPHL$estimate[1],
rate = fit_CPHL$estimate[2])
lines(density(rand.gamma),col="red")
CPHL_value<-na.omit(CPHL_positive_value)
CPHL_vector<-c(CPHL_value)
however for gamma distribution, p-value is too small, we reject the hypothesis
fit_CPHL<-fitdist(CPHL_vector, "gamma")
plot(fit_CPHL)
cvm.test(na.omit(CPHL_positive_value),"pgamma",shape = fit_CPHL$estimate[1],rate = fit_CPHL$estimate[2])
##
## Cramer-von Mises test of goodness-of-fit
## Null hypothesis: Gamma distribution
## with parameters shape = 3.14158982307635, rate = 5.5431097486641
## Parameters assumed to be fixed
##
## data: na.omit(CPHL_positive_value)
## omega2 = 670.62, p-value < 2.2e-16
This diagram shows us which distribution might be for the response variable. Since the observation and boostrapped values are mainly landed at the bottom. Based on the diagram, we cannot have a final decision.
descdist(CPHL_vector,boot = 100, discrete = FALSE)
## summary statistics
## ------
## min: 1e-04 max: 15.9128
## median: 0.5122
## mean: 0.5667723
## estimated sd: 0.310692
## estimated skewness: 2.51739
## estimated kurtosis: 48.08185
p-value too small, therefore not a beta distribution
fit_CPHL_beta<-fitdist(CPHL_vector/100, "beta")
#plot(fit_CPHL_beta)
cvm.test(na.omit(CPHL_positive_value),"pbeta",shape1 = fit_CPHL_beta$estimate[1],shape2 = fit_CPHL_beta$estimate[2])
##
## Cramer-von Mises test of goodness-of-fit
## Null hypothesis: beta distribution
## with parameters shape1 = 3.12650554093763, shape2 = 548.647476092144
## Parameters assumed to be fixed
##
## data: na.omit(CPHL_positive_value)
## omega2 = 527344, p-value < 2.2e-16
p-value too small, not a normal distribution
ks.test(CPHL_vector,"pnorm")
## Warning in ks.test(CPHL_vector, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: CPHL_vector
## D = 0.51268, p-value < 2.2e-16
## alternative hypothesis: two-sided
p-value too small, not a log normal distribution
ks.test(CPHL_vector,"plnorm")
## Warning in ks.test(CPHL_vector, "plnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: CPHL_vector
## D = 0.4169, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(CPHL_vector,"plogis")
## Warning in ks.test(CPHL_vector, "plogis"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: CPHL_vector
## D = 0.50305, p-value < 2.2e-16
## alternative hypothesis: two-sided
#fit <- fitDist(CPHL_vector, k = 2, type = "realplus", trace = FALSE, try.gamlss = TRUE)
#summary(fit)
We will then look at the correlation, calculate the correlation matrix and draw a correlation plot
cor_gliders<-gliders_valid[,c(10,12,14,16,18,20,22,24,26,28,30,32,34,36,44,50,52,54,56)]
cor_gliders<-na.omit(cor_gliders)
cor(cor_gliders)
## LATITUDE LONGITUDE PRES DEPTH PROFILE
## LATITUDE 1.00000000 -0.912478864 -0.121970430 -0.121534554 0.557930174
## LONGITUDE -0.91247886 1.000000000 0.108698255 0.108329000 -0.652572700
## PRES -0.12197043 0.108698255 1.000000000 0.999999844 -0.008701475
## DEPTH -0.12153455 0.108329000 0.999999844 1.000000000 -0.008467876
## PROFILE 0.55793017 -0.652572700 -0.008701475 -0.008467876 1.000000000
## PHASE 0.05035957 -0.054431385 0.028164884 0.028188523 0.034240325
## TEMP 0.84783867 -0.908898705 -0.154921418 -0.154568119 0.794417359
## PSAL -0.06272787 0.415782393 0.131589931 0.131624916 -0.257238698
## DOX1 -0.86853465 0.819826205 -0.080381377 -0.080765062 -0.401137404
## DOX2 -0.87696424 0.823745572 -0.055059349 -0.055445942 -0.403082804
## CPHL -0.10062015 0.013705696 -0.095432061 -0.095433453 -0.084960865
## CDOM -0.06980283 0.110778218 0.053106574 0.053105762 -0.052997403
## CNDC 0.85484999 -0.885696031 -0.139113213 -0.138751091 0.790885100
## VBSC 0.15764057 -0.220949160 -0.022081527 -0.022002782 0.182499741
## HEAD -0.11255137 0.173964611 -0.060476018 -0.060552838 -0.268255714
## IRRAD443 0.03539622 0.011403635 -0.305939677 -0.305955245 -0.010751948
## IRRAD490 0.03359733 0.011762414 -0.308475197 -0.308488954 -0.013329206
## IRRAD555 0.02067306 0.001297842 -0.290644264 -0.290667429 -0.014633509
## IRRAD670 0.01084460 -0.009899544 -0.142553594 -0.142570719 -0.018869961
## PHASE TEMP PSAL DOX1 DOX2
## LATITUDE 0.050359573 0.84783867 -0.06272787 -0.86853465 -0.87696424
## LONGITUDE -0.054431385 -0.90889871 0.41578239 0.81982621 0.82374557
## PRES 0.028164884 -0.15492142 0.13158993 -0.08038138 -0.05505935
## DEPTH 0.028188523 -0.15456812 0.13162492 -0.08076506 -0.05544594
## PROFILE 0.034240325 0.79441736 -0.25723870 -0.40113740 -0.40308280
## PHASE 1.000000000 0.04378828 -0.02585608 -0.06598361 -0.06288317
## TEMP 0.043788281 1.00000000 -0.27582846 -0.64556840 -0.65821875
## PSAL -0.025856076 -0.27582846 1.00000000 0.13688353 0.12220988
## DOX1 -0.065983608 -0.64556840 0.13688353 1.00000000 0.99420059
## DOX2 -0.062883168 -0.65821875 0.12220988 0.99420059 1.00000000
## CPHL 0.017197357 -0.10380024 -0.18050703 0.03483310 0.04364376
## CDOM 0.006706789 -0.08867170 0.10085210 0.07748184 0.08168960
## CNDC 0.041693575 0.99581222 -0.18789887 -0.63996695 -0.65401560
## VBSC 0.046207014 0.21489991 -0.16728267 -0.10022371 -0.08418186
## HEAD -0.039380213 -0.21354917 0.08663419 0.09434946 0.09384144
## IRRAD443 0.015998630 0.03729050 0.07913636 0.13722725 0.12644289
## IRRAD490 0.021095901 0.03377881 0.07307605 0.14105569 0.13064252
## IRRAD555 0.005536004 0.03104157 0.02163915 0.12764225 0.11854730
## IRRAD670 -0.020238699 0.01796749 -0.01379555 0.04145448 0.03654562
## CPHL CDOM CNDC VBSC HEAD
## LATITUDE -0.10062015 -0.069802827 0.85484999 0.157640574 -0.112551366
## LONGITUDE 0.01370570 0.110778218 -0.88569603 -0.220949160 0.173964611
## PRES -0.09543206 0.053106574 -0.13911321 -0.022081527 -0.060476018
## DEPTH -0.09543345 0.053105762 -0.13875109 -0.022002782 -0.060552838
## PROFILE -0.08496087 -0.052997403 0.79088510 0.182499741 -0.268255714
## PHASE 0.01719736 0.006706789 0.04169358 0.046207014 -0.039380213
## TEMP -0.10380024 -0.088671695 0.99581222 0.214899914 -0.213549169
## PSAL -0.18050703 0.100852097 -0.18789887 -0.167282668 0.086634188
## DOX1 0.03483310 0.077481840 -0.63996695 -0.100223714 0.094349460
## DOX2 0.04364376 0.081689603 -0.65401560 -0.084181860 0.093841439
## CPHL 1.00000000 0.148703918 -0.12478743 0.294088672 -0.036607286
## CDOM 0.14870392 1.000000000 -0.08014901 0.222915604 0.008238268
## CNDC -0.12478743 -0.080149008 1.00000000 0.205110388 -0.211483736
## VBSC 0.29408867 0.222915604 0.20511039 1.000000000 -0.057866663
## HEAD -0.03660729 0.008238268 -0.21148374 -0.057866663 1.000000000
## IRRAD443 -0.29121472 -0.011593113 0.04382933 -0.036704461 -0.001917468
## IRRAD490 -0.27350904 -0.010331754 0.03962314 -0.034630445 -0.003736401
## IRRAD555 -0.25021044 -0.010795613 0.03220191 -0.024689586 -0.010992031
## IRRAD670 -0.12444173 -0.005522851 0.01624218 -0.002849081 -0.008791674
## IRRAD443 IRRAD490 IRRAD555 IRRAD670
## LATITUDE 0.035396219 0.033597327 0.020673062 0.010844602
## LONGITUDE 0.011403635 0.011762414 0.001297842 -0.009899544
## PRES -0.305939677 -0.308475197 -0.290644264 -0.142553594
## DEPTH -0.305955245 -0.308488954 -0.290667429 -0.142570719
## PROFILE -0.010751948 -0.013329206 -0.014633509 -0.018869961
## PHASE 0.015998630 0.021095901 0.005536004 -0.020238699
## TEMP 0.037290499 0.033778810 0.031041570 0.017967493
## PSAL 0.079136356 0.073076054 0.021639150 -0.013795553
## DOX1 0.137227253 0.141055692 0.127642250 0.041454478
## DOX2 0.126442885 0.130642520 0.118547299 0.036545624
## CPHL -0.291214717 -0.273509041 -0.250210436 -0.124441731
## CDOM -0.011593113 -0.010331754 -0.010795613 -0.005522851
## CNDC 0.043829330 0.039623137 0.032201906 0.016242177
## VBSC -0.036704461 -0.034630445 -0.024689586 -0.002849081
## HEAD -0.001917468 -0.003736401 -0.010992031 -0.008791674
## IRRAD443 1.000000000 0.995043435 0.958353383 0.534655413
## IRRAD490 0.995043435 1.000000000 0.949908054 0.512148984
## IRRAD555 0.958353383 0.949908054 1.000000000 0.664028520
## IRRAD670 0.534655413 0.512148984 0.664028520 1.000000000
corrplot(corr=cor(cor_gliders),method = "color",tl.col="black")
calculate the percantage of good data, bad data
hist(gliders_valid$TIME_quality_control) # all good data
hist(gliders_valid$LATITUDE_quality_control) # 1.5% good data, rest are Interpolated value
hist(gliders_valid$LONGITUDE_quality_control) # 1.5% good data, rest are Interpolated value
hist(gliders_valid$PRES_quality_control) # all good data
hist(gliders_valid$DEPTH_quality_control)# all good data
hist(gliders_valid$PROFILE_quality_control) # all No QC performed data
hist(gliders_valid$PHASE_quality_control)# all No QC performed data
hist(gliders_valid$TEMP_quality_control)# all good data
hist(gliders_valid$PSAL_quality_control)# all good data
hist(gliders_valid$DOX1_quality_control)# all good data
hist(gliders_valid$DOX2_quality_control)# 81% good data, 11.3% Bad data that are potentially correctable, 8% Missing value
hist(gliders_valid$CPHL_quality_control)# all good data
hist(gliders_valid$CDOM_quality_control)# all good data
hist(gliders_valid$CNDC_quality_control)# all good data
hist(gliders_valid$VBSC_quality_control)# 86% good data, 9% Bad data that are potentially correctable, 5% bad data
hist(gliders_valid$UCUR_quality_control)# missing value
hist(gliders_valid$VCUR_quality_control)# missing value
hist(gliders_valid$HEAD_quality_control)# 99.9% No QC performed data
hist(gliders_valid$UCUR_GPS_quality_control)# missing value
hist(gliders_valid$VCUR_GPS_quality_control)# missing value
hist(gliders_valid$IRRAD443_quality_control) # 55% good data, 45% bad data
hist(gliders_valid$IRRAD490_quality_control) # 55% good data, 45% bad data
hist(gliders_valid$IRRAD555_quality_control) # 55% good data, 45% bad data
hist(gliders_valid$IRRAD670_quality_control) # 55% good data, 45% bad data
specific calculation of the percentage for each variable’s missing data, good data and bad data
table(gliders_valid$LATITUDE_quality_control)
##
## 1 8 9
## 23701 1562349 366
23701/(23701+1562349+366)
## [1] 0.01493997
366/(23701+1562349+366)
## [1] 0.0002307087
table(gliders_valid$LONGITUDE_quality_control)
##
## 1 8 9
## 23701 1562349 366
table(gliders_valid$PRES_quality_control)
##
## 1
## 1586416
table(gliders_valid$PROFILE_quality_control)
##
## 0 9
## 1586399 17
table(gliders_valid$PHASE_quality_control)
##
## 0 9
## 1586399 17
table(gliders_valid$DOX2_quality_control)# almost good data
##
## 0 1 3 4 9
## 1930 1277617 179281 703 126885
126885/(1930+1277617+179281+703+126885)
## [1] 0.07998217
table(gliders_valid$VBSC_quality_control)# almost good data
##
## 0 1 3 4
## 2230 1368673 139038 76475
76475/(2230+1368673+139038+76475)
## [1] 0.04820615
table(gliders_valid$UCUR_quality_control)
##
## 0 9
## 777 1585639
table(gliders_valid$VCUR_quality_control)
##
## 0 9
## 777 1585639
table(gliders_valid$HEAD_quality_control)
##
## 0 9
## 1584749 1667
1584749/(1584749+1667)
## [1] 0.9989492
table(gliders_valid$UCUR_GPS_quality_control)
##
## 0 9
## 777 1585639
table(gliders_valid$VCUR_GPS_quality_control)
##
## 0 9
## 777 1585639
table(gliders_valid$IRRAD443_quality_control)
##
## 1 4
## 863269 723147
table(gliders_valid$IRRAD490_quality_control)
##
## 1 4
## 863257 723159
table(gliders_valid$IRRAD555_quality_control)
##
## 1 4
## 863317 723099
table(gliders_valid$IRRAD670_quality_control)
##
## 1 4
## 862798 723618
One platform type four different timings in dataset
table(gliders_valid$platform_type)
##
## slocum glider
## 1586416
table(gliders_valid$time_coverage_start)
##
## 2013-02-15T03:13:29Z 2013-10-31T01:16:21Z 2014-08-08T02:48:06Z
## 202337 539721 685764
## 2014-10-17T00:40:46Z
## 158594
table(gliders_valid$time_coverage_end)
##
## 2013-03-11T20:14:20Z 2013-11-13T05:44:04Z 2014-08-24T22:39:08Z
## 202337 539721 685764
## 2014-11-06T22:18:12Z
## 158594
Nothing inside NTRA_quality_control
There are four different deployments in dataset, the number for each deployment is vary.StormBay2014017 has the least and TwoRocks20140808 has the most number.
table(gliders_valid$NTRA_quality_control)
## < table of extent 0 >
table(gliders_valid$deployment_name)
##
## SpencerGulf20131031 StormBay20141017 TwoRocks20130215 TwoRocks20140808
## 539721 158594 202337 685764
Trying to find potential relationship between response vs each variable and fits a cubic smoothing spline to the supplied data(cubic smoothing spline 类似于line of best fit). CPHL vs Temp: Peak is around degree at 20. It starts to decrease when the temperature is over 20. When it is around 21, the CPHL tends to be 0. PRES: Spread evenly, can’t see any strong relationship
mySpline<-smooth.spline(na.omit(gliders_valid$TEMP),na.omit(gliders_valid$CPHL))
plot(gliders_valid$TEMP,gliders_valid$CPHL)
lines(mySpline$x, mySpline$y, col="red", lwd = 2)
myPres<-smooth.spline(na.omit(gliders_valid$PRES),na.omit(gliders_valid$CPHL))
plot(gliders_valid$PRES,gliders_valid$CPHL)
lines(myPres$x, myPres$y, col="red", lwd = 2)
Depth: spread evenly, can’t see much relationship.
depth<-data.frame(gliders_valid$DEPTH,gliders_valid$CPHL)
depth<-na.omit(depth)
myDepth<-smooth.spline(depth$gliders_valid.DEPTH,depth$gliders_valid.CPHL)
plot(gliders_valid$DEPTH,gliders_valid$CPHL)
lines(myDepth$x, myDepth$y, col="red", lwd = 2)
PSAL: most data is between 34-36, there are four data below 34.Two peak achieve when it is around 35-36
myPsal<-smooth.spline(na.omit(gliders_valid$PSAL),na.omit(gliders_valid$CPHL))
plot(gliders_valid$PSAL,gliders_valid$CPHL)
lines(myPsal$x, myPsal$y, col="red", lwd = 2)
plot(gliders_valid$LATITUDE,gliders_valid$CPHL)
plot(gliders_valid$LONGITUDE,gliders_valid$CPHL)
plot(gliders_valid$PROFILE,gliders_valid$CPHL)
Phase data only at 0 1 3 4
phase<-data.frame(gliders_valid$PHASE,gliders_valid$CPHL)
phase<-na.omit(phase)
myPhase<-smooth.spline(phase$gliders_valid.PHASE,phase$gliders_valid.CPHL)
plot(gliders_valid$PHASE,gliders_valid$CPHL)
lines(myPhase$x, myPhase$y, col="red", lwd = 2)
myDox1<-smooth.spline(na.omit(gliders_valid$DOX1),na.omit(gliders_valid$CPHL))
plot(gliders_valid$DOX1,gliders_valid$CPHL)
lines(myDox1$x, myDox1$y, col="red", lwd = 2)
DOX2: peak appears around 190, it decreases after and reach around 0 when DOX2 is over 220 CDOM: more data around 0-50, based on the smoothing spline there is a potential positive relationship between two
dox2<-data.frame(gliders_valid$DOX2,gliders_valid$CPHL)
dox2<-na.omit(dox2)
myDox2<-smooth.spline(dox2$gliders_valid.DOX2,dox2$gliders_valid.CPHL)
plot(gliders_valid$DOX2,gliders_valid$CPHL)
lines(myDox2$x, myDox2$y, col="red", lwd = 2)
myCDOM<-smooth.spline(na.omit(gliders_valid$CDOM),na.omit(gliders_valid$CPHL))
plot(gliders_valid$CDOM,gliders_valid$CPHL)
lines(myCDOM$x, myCDOM$y, col="red", lwd = 2)
CNDC: peak around 4.8, tends to be 0 at 4.2 and 5.0. VBSC: CPHL increase when the VBSC is around 0.000-0.006 have a sudden drop at 0.006 at tends to be 0 HEAD: spread evenly, but highest value is around head equals to 250 IRRAD443, IRRAD555,IRRAD490 have the similar diagram, appears to have an inverse relationship
cndc<-data.frame(gliders_valid$CNDC,gliders_valid$CPHL)
cndc<-na.omit(cndc)
mycndc<-smooth.spline(cndc$gliders_valid.CNDC,cndc$gliders_valid.CPHL)
plot(gliders_valid$CNDC,gliders_valid$CPHL)
lines(mycndc$x, mycndc$y, col="red", lwd = 2)
vbsc<-data.frame(gliders_valid$VBSC,gliders_valid$CPHL)
vbsc<-na.omit(vbsc)
myvbsc<-smooth.spline(vbsc$gliders_valid.VBSC,vbsc$gliders_valid.CPHL)
plot(gliders_valid$VBSC,gliders_valid$CPHL)
lines(myvbsc$x, myvbsc$y, col="red", lwd = 2)
head<-data.frame(gliders_valid$HEAD,gliders_valid$CPHL)
head<-na.omit(head)
myhead<-smooth.spline(head$gliders_valid.HEAD,head$gliders_valid.CPHL)
plot(gliders_valid$HEAD,gliders_valid$CPHL)
lines(myhead$x, myhead$y, col="red", lwd = 2)
irrd<-data.frame(gliders_valid$IRRAD443,gliders_valid$CPHL)
irrd<-na.omit(irrd)
myirrd<-smooth.spline(irrd$gliders_valid.IRRAD443,irrd$gliders_valid.CPHL)
plot(gliders_valid$IRRAD443,gliders_valid$CPHL)
lines(myirrd$x, myirrd$y, col="red", lwd = 2)
irrad<-data.frame(gliders_valid$IRRAD555,gliders_valid$CPHL)
irrad<-na.omit(irrad)
myirrad<-smooth.spline(irrad$gliders_valid.IRRAD555,irrad$gliders_valid.CPHL)
plot(gliders_valid$IRRAD555,gliders_valid$CPHL)
lines(myirrad$x, myirrad$y, col="red", lwd = 2)
irad<-data.frame(gliders_valid$IRRAD490,gliders_valid$CPHL)
irad<-na.omit(irad)
myirad<-smooth.spline(irad$gliders_valid.IRRAD490,irad$gliders_valid.CPHL)
plot(gliders_valid$IRRAD490,gliders_valid$CPHL)
lines(myirad$x, myirad$y, col="red", lwd = 2)
understand the dataset
summary(gliders_valid)
## FID file_id deployment_name platform_type
## Length:3101322 Min. :185.0 Length:3101322 Length:3101322
## Class :character 1st Qu.:188.0 Class :character Class :character
## Mode :character Median :189.0 Mode :character Mode :character
## Mean :188.2
## 3rd Qu.:189.0
## Max. :190.0
## NA's :1514906
## platform_code time_coverage_start time_coverage_end TIME
## Length:3101322 Length:3101322 Length:3101322 Length:3101322
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TIME_quality_control LATITUDE LATITUDE_quality_control
## Min. :1 Min. :-43.7 Min. :1.0
## 1st Qu.:1 1st Qu.:-35.4 1st Qu.:8.0
## Median :1 Median :-32.1 Median :8.0
## Mean :1 Mean :-34.1 Mean :7.9
## 3rd Qu.:1 3rd Qu.:-31.6 3rd Qu.:8.0
## Max. :4 Max. :-31.5 Max. :9.0
## NA's :1514906 NA's :1515272 NA's :1514906
## LONGITUDE LONGITUDE_quality_control PRES
## Min. :115.0 Min. :1.0 Min. : 0.0
## 1st Qu.:115.3 1st Qu.:8.0 1st Qu.: 15.8
## Median :115.5 Median :8.0 Median : 33.9
## Mean :125.6 Mean :7.9 Mean : 44.0
## 3rd Qu.:136.0 3rd Qu.:8.0 3rd Qu.: 65.7
## Max. :147.8 Max. :9.0 Max. :198.9
## NA's :1515272 NA's :1514906 NA's :1514906
## PRES_quality_control DEPTH DEPTH_quality_control PROFILE
## Min. :1 Min. : 0.0 Min. :1 Min. : 0
## 1st Qu.:1 1st Qu.: 15.7 1st Qu.:1 1st Qu.: 662
## Median :1 Median : 33.7 Median :1 Median : 1557
## Mean :1 Mean : 43.6 Mean :1 Mean : 2315
## 3rd Qu.:1 3rd Qu.: 65.3 3rd Qu.:1 3rd Qu.: 3928
## Max. :1 Max. :197.5 Max. :9 Max. :15257
## NA's :1514906 NA's :1515244 NA's :1514906 NA's :1514923
## PROFILE_quality_control PHASE PHASE_quality_control
## Min. :0 Min. :0.0 Min. :0
## 1st Qu.:0 1st Qu.:1.0 1st Qu.:0
## Median :0 Median :1.0 Median :0
## Mean :0 Mean :2.4 Mean :0
## 3rd Qu.:0 3rd Qu.:4.0 3rd Qu.:0
## Max. :9 Max. :4.0 Max. :9
## NA's :1514906 NA's :1514923 NA's :1514906
## TEMP TEMP_quality_control PSAL PSAL_quality_control
## Min. :12.6 Min. :0 Min. :32.4 Min. :0
## 1st Qu.:16.1 1st Qu.:1 1st Qu.:35.2 1st Qu.:1
## Median :18.9 Median :1 Median :35.3 Median :1
## Mean :18.2 Mean :1 Mean :35.4 Mean :1
## 3rd Qu.:20.3 3rd Qu.:1 3rd Qu.:35.7 3rd Qu.:1
## Max. :24.1 Max. :1 Max. :36.2 Max. :4
## NA's :1514906 NA's :1514906 NA's :1514906 NA's :1514906
## DOX1 DOX1_quality_control DOX2 DOX2_quality_control
## Min. :178.1 Min. :0 Min. :176.6 Min. :0.0
## 1st Qu.:192.4 1st Qu.:1 1st Qu.:188.2 1st Qu.:1.0
## Median :201.0 Median :1 Median :196.7 Median :1.0
## Mean :206.4 Mean :1 Mean :202.0 Mean :1.9
## 3rd Qu.:216.3 3rd Qu.:1 3rd Qu.:211.0 3rd Qu.:1.0
## Max. :264.2 Max. :1 Max. :258.4 Max. :9.0
## NA's :1514906 NA's :1514906 NA's :1641791 NA's :1514906
## CPHL CPHL_quality_control CDOM CDOM_quality_control
## Min. : 0.0 Min. :0.0 Min. : 0.0 Min. :0.0
## 1st Qu.: 0.4 1st Qu.:1.0 1st Qu.: 0.4 1st Qu.:1.0
## Median : 0.5 Median :1.0 Median : 0.7 Median :1.0
## Mean : 0.6 Mean :1.1 Mean : 0.8 Mean :1.1
## 3rd Qu.: 0.7 3rd Qu.:1.0 3rd Qu.: 1.1 3rd Qu.:1.0
## Max. :15.9 Max. :4.0 Max. :242.8 Max. :4.0
## NA's :1514906 NA's :1514906 NA's :1514906 NA's :1514906
## CNDC CNDC_quality_control VBSC VBSC_quality_control
## Min. :4.0 Min. :0 Min. :0 Min. :0.0
## 1st Qu.:4.5 1st Qu.:1 1st Qu.:0 1st Qu.:1.0
## Median :4.7 Median :1 Median :0 Median :1.0
## Mean :4.7 Mean :1 Mean :0 Mean :1.3
## 3rd Qu.:4.9 3rd Qu.:1 3rd Qu.:0 3rd Qu.:1.0
## Max. :5.3 Max. :4 Max. :0 Max. :4.0
## NA's :1514906 NA's :1514906 NA's :1514906 NA's :1514906
## NTRA NTRA_quality_control UCUR UCUR_quality_control
## Min. : NA Min. : NA Min. :-0.3 Min. :0
## 1st Qu.: NA 1st Qu.: NA 1st Qu.:-0.1 1st Qu.:9
## Median : NA Median : NA Median : 0.0 Median :9
## Mean :NaN Mean :NaN Mean : 0.0 Mean :9
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: 0.1 3rd Qu.:9
## Max. : NA Max. : NA Max. : 0.3 Max. :9
## NA's :3101322 NA's :3101322 NA's :3100545 NA's :1514906
## VCUR VCUR_quality_control HEAD HEAD_quality_control
## Min. :-0.5 Min. :0 Min. : 0.0 Min. :0
## 1st Qu.:-0.1 1st Qu.:9 1st Qu.:104.3 1st Qu.:0
## Median : 0.0 Median :9 Median :166.5 Median :0
## Mean : 0.0 Mean :9 Mean :174.3 Mean :0
## 3rd Qu.: 0.1 3rd Qu.:9 3rd Qu.:262.9 3rd Qu.:0
## Max. : 0.4 Max. :9 Max. :359.9 Max. :9
## NA's :3100545 NA's :1514906 NA's :1516573 NA's :1514906
## UCUR_GPS UCUR_GPS_quality_control VCUR_GPS
## Min. :-0.6 Min. :0 Min. :-0.9
## 1st Qu.:-0.2 1st Qu.:9 1st Qu.:-0.1
## Median :-0.1 Median :9 Median : 0.1
## Mean :-0.1 Mean :9 Mean : 0.0
## 3rd Qu.: 0.1 3rd Qu.:9 3rd Qu.: 0.2
## Max. : 0.4 Max. :9 Max. : 0.6
## NA's :3100545 NA's :1514906 NA's :3100545
## VCUR_GPS_quality_control IRRAD443 IRRAD443_quality_control
## Min. :0 Min. : 0.0 Min. :1.0
## 1st Qu.:9 1st Qu.: 0.0 1st Qu.:1.0
## Median :9 Median : 0.0 Median :1.0
## Mean :9 Mean : 6.3 Mean :2.4
## 3rd Qu.:9 3rd Qu.: 3.1 3rd Qu.:4.0
## Max. :9 Max. :321.8 Max. :4.0
## NA's :1514906 NA's :1514906 NA's :1514906
## IRRAD490 IRRAD490_quality_control IRRAD555
## Min. : 0.0 Min. :1.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.:1.0 1st Qu.: 0.0
## Median : 0.0 Median :1.0 Median : 0.0
## Mean : 7.7 Mean :2.4 Mean : 4.5
## 3rd Qu.: 4.8 3rd Qu.:4.0 3rd Qu.: 1.1
## Max. :334.4 Max. :4.0 Max. :331.1
## NA's :1514906 NA's :1514906 NA's :1514906
## IRRAD555_quality_control IRRAD670 IRRAD670_quality_control
## Min. :1.0 Min. : 0.0 Min. :1.0
## 1st Qu.:1.0 1st Qu.: 0.0 1st Qu.:1.0
## Median :1.0 Median : 0.0 Median :1.0
## Mean :2.4 Mean : 1.5 Mean :2.4
## 3rd Qu.:4.0 3rd Qu.: 0.0 3rd Qu.:4.0
## Max. :4.0 Max. :326.0 Max. :4.0
## NA's :1514906 NA's :1514906 NA's :1514906
## geom
## Length:3101322
## Class :character
## Mode :character
##
##
##
##